Regression (People Income) Using a scikit MLPRegressor Neural Network

The scikit-learn library was originally designed for classical machine learning techniques like logistic regression and naive Bayes classification. The library eventually added the ability to do binary and multi-class classification via the MLPClassifier (multi-layer perceptron) class and regression via the MLPRegressor class. As best as I can determine by wading through the scikit change logs, these two classes were added in version 0.18 in early 2017.

I decided to take a look at regression using the scikit MLPRegressor class.

In my work environment, when I need to tackle a regression problem (i.e., predict a single numeric value such as a person’s annual income), I use PyTorch. PyTorch is very complex but it gives me the flexibility I need and PyTorch can do much more sophisticated things than scikit, notably image classification, natural language processing, unsupervised anomaly detection, and Transformer architecture systems.

But scikit is easy to use and makes sense in some scenarios.

My data is synthetic and looks like:

 1   0.24   1 0 0   0.2950   0 0 1
-1   0.39   0 0 1   0.5120   0 1 0
 1   0.63   0 1 0   0.7580   1 0 0
-1   0.36   1 0 0   0.4450   0 1 0
 1   0.27   0 1 0   0.2860   0 0 1
. . .

There are 200 training items and 40 test items.

The first value in column [0] is sex (M = -1, F = +1). Column [1] is age, normalized by dividing by 100. Columns [2,3,4] is State one-hot encoded (Michigan = 100, Nebraska = 010, Oklahoma = 001). Column [5] is annual income, divided by $100,000, and is the value to predict. Columns [6,7,8] is political leaning (conservative = 100, moderate = 010, liberal = 001).

Setting up a scikit MLP regressor is daunting because there are a lot of parameters:

  params = { 'hidden_layer_sizes' : [10,10],
    'activation' : 'relu',
    'solver' : 'adam',
    'alpha' : 0.0,
    'batch_size' : 10,
    'random_state' : 0,
    'tol' : 0.0001,
    'nesterovs_momentum' : False,
    'learning_rate' : 'constant',
    'learning_rate_init' : 0.01,
    'max_iter' : 1000,
    'shuffle' : True,
    'n_iter_no_change' : 50,
    'verbose' : False }
       
  print("Creating 8-(10-10)-1 tanh neural network ")
  net = MLPRegressor(**params)

My demo implements an accuracy() function. For most scikit classes, there is a score() function that gives simple accuracy but with regression you must specify what a correct prediction is — for example within 10% of the correct target value.

Good fun.



Predicting income is a difficult task with real data. For jobs that rely mostly on tips, such as golf course beverage cart driver, predicting income is especially difficult. I suspect the cart driver on the left makes more money from tips than the cart driver on the right.


Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

# people_income_nn_sckit.py

# predict income  
# from sex, age, state, politics

# sex  age   state   income  politics
#  1  0.24  1  0  0  0.2950  0  0  1
# -1  0.39  0  0  1  0.5120  0  1  0
# state: michigan = 100, nebraska = 010, oklahoma = 001
# conservative = 100, moderate = 010, liberal = 001

# Anaconda3-2020.02  Python 3.7.6  scikit 0.22.1
# Windows 10/11

import numpy as np 
from sklearn.neural_network import MLPRegressor
import warnings
warnings.filterwarnings('ignore')  # early-stop warnings

# ---------------------------------------------------------

def accuracy(model, data_x, data_y, pct_close=0.10):
  # accuracy predicted within pct_close of actual income
  # item-by-item allows inspection but is slow
  n_correct = 0; n_wrong = 0
  predicteds = model.predict(data_x)  # all predicteds
  for i in range(len(predicteds)):
    actual = data_y[i]
    pred = predicteds[i]

    if np.abs(pred - actual) "lt" np.abs(pct_close * actual):
      n_correct += 1
    else:
      n_wrong += 1
  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# ---------------------------------------------------------

def accuracy_q(model, data_x, data_y, pct_close=0.10):
  # accuracy within pct_close of actual income
  # all-at-once is quick
  n_items = len(data_y)
  preds = model.predict(data_x)  # all predicteds
 
  n_correct = np.sum((np.abs(preds - data_y) "lt" \
    np.abs(pct_close * data_y)))
  result = (n_correct / n_items) 
  return result 

# ---------------------------------------------------------

def main():
  # 0. get ready
  print("\nBegin scikit neural network regression example ")
  print("Predict income from sex, age, State, politics ")
  np.random.seed(1)
  np.set_printoptions(precision=4, suppress=True)

  # 1. load data
  print("\nLoading data into memory ")
  train_file = ".\\Data\\people_train.txt"
  train_xy = np.loadtxt(train_file, usecols=range(0,9),
    delimiter="\t", comments="#",  dtype=np.float32) 
  train_x = train_xy[:,[0,1,2,3,4,6,7,8]]
  train_y = train_xy[:,5]

  test_file = ".\\Data\\people_test.txt"
  test_xy = np.loadtxt(test_file, usecols=range(0,9),
    delimiter="\t", comments="#",  dtype=np.float32) 
  test_x = test_xy[:,[0,1,2,3,4,6,7,8]]
  test_y = test_xy[:,5]

  print("\nTraining data:")
  print(train_x[0:4])
  print(". . . \n")
  print(train_y[0:4])
  print(". . . ")

# ---------------------------------------------------------

  # 2. create network 
  # MLPRegressor(hidden_layer_sizes=(100,),
  #  activation='relu', *, solver='adam', alpha=0.0001,
  #  batch_size='auto', learning_rate='constant',
  #  learning_rate_init=0.001, power_t=0.5, max_iter=200,
  #  shuffle=True, random_state=None, tol=0.0001,
  #  verbose=False, warm_start=False, momentum=0.9,
  #  nesterovs_momentum=True, early_stopping=False,
  #  validation_fraction=0.1, beta_1=0.9, beta_2=0.999,
  #  epsilon=1e-08, n_iter_no_change=10, max_fun=15000)

  params = { 'hidden_layer_sizes' : [10,10],
    'activation' : 'relu',
    'solver' : 'adam',
    'alpha' : 0.0,
    'batch_size' : 10,
    'random_state' : 0,
    'tol' : 0.0001,
    'nesterovs_momentum' : False,
    'learning_rate' : 'constant',
    'learning_rate_init' : 0.01,
    'max_iter' : 1000,
    'shuffle' : True,
    'n_iter_no_change' : 50,
    'verbose' : False }
       
  print("\nCreating 8-(10-10)-1 relu neural network ")
  net = MLPRegressor(**params)

# ---------------------------------------------------------

  # 3. train
  print("\nTraining with bat sz = " + \
    str(params['batch_size']) + " lrn rate = " + \
    str(params['learning_rate_init']) + " ")
  print("Stop if no change " + \
    str(params['n_iter_no_change']) + " iterations ")
  net.fit(train_x, train_y)
  print("Done ")

# ---------------------------------------------------------

  # 4. evaluate model
  # score() is coefficient of determination for MLPRegressor
  print("\nCompute model accuracy (within 0.10 of actual) ")
  acc_train = accuracy(net, train_x, train_y, 0.10)
  print("\nAccuracy on train = %0.4f " % acc_train)
  acc_test = accuracy(net, test_x, test_y, 0.10)
  print("Accuracy on test = %0.4f " % acc_test)

  # print("\nModel accuracy quick (within 0.10 of actual) ")
  # acc_train = accuracy_q(net, train_x, train_y, 0.10)
  # print("\nAccuracy on train = %0.4f " % acc_train)
  # acc_test = accuracy_q(net, test_x, test_y, 0.10)
  # print("Accuracy on test = %0.4f " % acc_test)
 
# ---------------------------------------------------------

  # 5. use model
  # no proba() for MLPRegressor
  print("\nSetting X = M 34 Oklahoma moderate: ")
  X = np.array([[-1, 0.34, 0,0,1,  0,1,0]])
  income = net.predict(X)  # divided by 100,000
  income *= 100000  # denormalize
  print("Predicted income: %0.2f " % income)

# ---------------------------------------------------------
  
  # 6. TODO: save model using pickle
  # import pickle
  # print("Saving trained network ")
  # path = ".\\Models\\people_income_net.sav"
  # pickle.dump(model, open(path, "wb"))

  # use saved model
  # X = np.array([[-1, 0.34, 0,0,1,  0,1,0]]],
  #   dtype=np.float32)
  # with open(path, 'rb') as f:
  #   loaded_model = pickle.load(f)
  # inc = loaded_model.predict(X)
  # print(inc)

  print("\nEnd scikit binary neural network demo ")

if __name__ == "__main__":
  main()

Training data. Replace commas with tabs or modify program.

# people_train.txt
#
# sex (-1 = male, 1 = female), age / 100,
# state (michigan = 100, nebraska = 010, oklahoma = 001)
# income / 100_000,
# conservative = 100, moderate = 010, liberal = 001
#
1,0.24,1,0,0,0.2950,0,0,1
-1,0.39,0,0,1,0.5120,0,1,0
1,0.63,0,1,0,0.7580,1,0,0
-1,0.36,1,0,0,0.4450,0,1,0
1,0.27,0,1,0,0.2860,0,0,1
1,0.50,0,1,0,0.5650,0,1,0
1,0.50,0,0,1,0.5500,0,1,0
-1,0.19,0,0,1,0.3270,1,0,0
1,0.22,0,1,0,0.2770,0,1,0
-1,0.39,0,0,1,0.4710,0,0,1
1,0.34,1,0,0,0.3940,0,1,0
-1,0.22,1,0,0,0.3350,1,0,0
1,0.35,0,0,1,0.3520,0,0,1
-1,0.33,0,1,0,0.4640,0,1,0
1,0.45,0,1,0,0.5410,0,1,0
1,0.42,0,1,0,0.5070,0,1,0
-1,0.33,0,1,0,0.4680,0,1,0
1,0.25,0,0,1,0.3000,0,1,0
-1,0.31,0,1,0,0.4640,1,0,0
1,0.27,1,0,0,0.3250,0,0,1
1,0.48,1,0,0,0.5400,0,1,0
-1,0.64,0,1,0,0.7130,0,0,1
1,0.61,0,1,0,0.7240,1,0,0
1,0.54,0,0,1,0.6100,1,0,0
1,0.29,1,0,0,0.3630,1,0,0
1,0.50,0,0,1,0.5500,0,1,0
1,0.55,0,0,1,0.6250,1,0,0
1,0.40,1,0,0,0.5240,1,0,0
1,0.22,1,0,0,0.2360,0,0,1
1,0.68,0,1,0,0.7840,1,0,0
-1,0.60,1,0,0,0.7170,0,0,1
-1,0.34,0,0,1,0.4650,0,1,0
-1,0.25,0,0,1,0.3710,1,0,0
-1,0.31,0,1,0,0.4890,0,1,0
1,0.43,0,0,1,0.4800,0,1,0
1,0.58,0,1,0,0.6540,0,0,1
-1,0.55,0,1,0,0.6070,0,0,1
-1,0.43,0,1,0,0.5110,0,1,0
-1,0.43,0,0,1,0.5320,0,1,0
-1,0.21,1,0,0,0.3720,1,0,0
1,0.55,0,0,1,0.6460,1,0,0
1,0.64,0,1,0,0.7480,1,0,0
-1,0.41,1,0,0,0.5880,0,1,0
1,0.64,0,0,1,0.7270,1,0,0
-1,0.56,0,0,1,0.6660,0,0,1
1,0.31,0,0,1,0.3600,0,1,0
-1,0.65,0,0,1,0.7010,0,0,1
1,0.55,0,0,1,0.6430,1,0,0
-1,0.25,1,0,0,0.4030,1,0,0
1,0.46,0,0,1,0.5100,0,1,0
-1,0.36,1,0,0,0.5350,1,0,0
1,0.52,0,1,0,0.5810,0,1,0
1,0.61,0,0,1,0.6790,1,0,0
1,0.57,0,0,1,0.6570,1,0,0
-1,0.46,0,1,0,0.5260,0,1,0
-1,0.62,1,0,0,0.6680,0,0,1
1,0.55,0,0,1,0.6270,1,0,0
-1,0.22,0,0,1,0.2770,0,1,0
-1,0.50,1,0,0,0.6290,1,0,0
-1,0.32,0,1,0,0.4180,0,1,0
-1,0.21,0,0,1,0.3560,1,0,0
1,0.44,0,1,0,0.5200,0,1,0
1,0.46,0,1,0,0.5170,0,1,0
1,0.62,0,1,0,0.6970,1,0,0
1,0.57,0,1,0,0.6640,1,0,0
-1,0.67,0,0,1,0.7580,0,0,1
1,0.29,1,0,0,0.3430,0,0,1
1,0.53,1,0,0,0.6010,1,0,0
-1,0.44,1,0,0,0.5480,0,1,0
1,0.46,0,1,0,0.5230,0,1,0
-1,0.20,0,1,0,0.3010,0,1,0
-1,0.38,1,0,0,0.5350,0,1,0
1,0.50,0,1,0,0.5860,0,1,0
1,0.33,0,1,0,0.4250,0,1,0
-1,0.33,0,1,0,0.3930,0,1,0
1,0.26,0,1,0,0.4040,1,0,0
1,0.58,1,0,0,0.7070,1,0,0
1,0.43,0,0,1,0.4800,0,1,0
-1,0.46,1,0,0,0.6440,1,0,0
1,0.60,1,0,0,0.7170,1,0,0
-1,0.42,1,0,0,0.4890,0,1,0
-1,0.56,0,0,1,0.5640,0,0,1
-1,0.62,0,1,0,0.6630,0,0,1
-1,0.50,1,0,0,0.6480,0,1,0
1,0.47,0,0,1,0.5200,0,1,0
-1,0.67,0,1,0,0.8040,0,0,1
-1,0.40,0,0,1,0.5040,0,1,0
1,0.42,0,1,0,0.4840,0,1,0
1,0.64,1,0,0,0.7200,1,0,0
-1,0.47,1,0,0,0.5870,0,0,1
1,0.45,0,1,0,0.5280,0,1,0
-1,0.25,0,0,1,0.4090,1,0,0
1,0.38,1,0,0,0.4840,1,0,0
1,0.55,0,0,1,0.6000,0,1,0
-1,0.44,1,0,0,0.6060,0,1,0
1,0.33,1,0,0,0.4100,0,1,0
1,0.34,0,0,1,0.3900,0,1,0
1,0.27,0,1,0,0.3370,0,0,1
1,0.32,0,1,0,0.4070,0,1,0
1,0.42,0,0,1,0.4700,0,1,0
-1,0.24,0,0,1,0.4030,1,0,0
1,0.42,0,1,0,0.5030,0,1,0
1,0.25,0,0,1,0.2800,0,0,1
1,0.51,0,1,0,0.5800,0,1,0
-1,0.55,0,1,0,0.6350,0,0,1
1,0.44,1,0,0,0.4780,0,0,1
-1,0.18,1,0,0,0.3980,1,0,0
-1,0.67,0,1,0,0.7160,0,0,1
1,0.45,0,0,1,0.5000,0,1,0
1,0.48,1,0,0,0.5580,0,1,0
-1,0.25,0,1,0,0.3900,0,1,0
-1,0.67,1,0,0,0.7830,0,1,0
1,0.37,0,0,1,0.4200,0,1,0
-1,0.32,1,0,0,0.4270,0,1,0
1,0.48,1,0,0,0.5700,0,1,0
-1,0.66,0,0,1,0.7500,0,0,1
1,0.61,1,0,0,0.7000,1,0,0
-1,0.58,0,0,1,0.6890,0,1,0
1,0.19,1,0,0,0.2400,0,0,1
1,0.38,0,0,1,0.4300,0,1,0
-1,0.27,1,0,0,0.3640,0,1,0
1,0.42,1,0,0,0.4800,0,1,0
1,0.60,1,0,0,0.7130,1,0,0
-1,0.27,0,0,1,0.3480,1,0,0
1,0.29,0,1,0,0.3710,1,0,0
-1,0.43,1,0,0,0.5670,0,1,0
1,0.48,1,0,0,0.5670,0,1,0
1,0.27,0,0,1,0.2940,0,0,1
-1,0.44,1,0,0,0.5520,1,0,0
1,0.23,0,1,0,0.2630,0,0,1
-1,0.36,0,1,0,0.5300,0,0,1
1,0.64,0,0,1,0.7250,1,0,0
1,0.29,0,0,1,0.3000,0,0,1
-1,0.33,1,0,0,0.4930,0,1,0
-1,0.66,0,1,0,0.7500,0,0,1
-1,0.21,0,0,1,0.3430,1,0,0
1,0.27,1,0,0,0.3270,0,0,1
1,0.29,1,0,0,0.3180,0,0,1
-1,0.31,1,0,0,0.4860,0,1,0
1,0.36,0,0,1,0.4100,0,1,0
1,0.49,0,1,0,0.5570,0,1,0
-1,0.28,1,0,0,0.3840,1,0,0
-1,0.43,0,0,1,0.5660,0,1,0
-1,0.46,0,1,0,0.5880,0,1,0
1,0.57,1,0,0,0.6980,1,0,0
-1,0.52,0,0,1,0.5940,0,1,0
-1,0.31,0,0,1,0.4350,0,1,0
-1,0.55,1,0,0,0.6200,0,0,1
1,0.50,1,0,0,0.5640,0,1,0
1,0.48,0,1,0,0.5590,0,1,0
-1,0.22,0,0,1,0.3450,1,0,0
1,0.59,0,0,1,0.6670,1,0,0
1,0.34,1,0,0,0.4280,0,0,1
-1,0.64,1,0,0,0.7720,0,0,1
1,0.29,0,0,1,0.3350,0,0,1
-1,0.34,0,1,0,0.4320,0,1,0
-1,0.61,1,0,0,0.7500,0,0,1
1,0.64,0,0,1,0.7110,1,0,0
-1,0.29,1,0,0,0.4130,1,0,0
1,0.63,0,1,0,0.7060,1,0,0
-1,0.29,0,1,0,0.4000,1,0,0
-1,0.51,1,0,0,0.6270,0,1,0
-1,0.24,0,0,1,0.3770,1,0,0
1,0.48,0,1,0,0.5750,0,1,0
1,0.18,1,0,0,0.2740,1,0,0
1,0.18,1,0,0,0.2030,0,0,1
1,0.33,0,1,0,0.3820,0,0,1
-1,0.20,0,0,1,0.3480,1,0,0
1,0.29,0,0,1,0.3300,0,0,1
-1,0.44,0,0,1,0.6300,1,0,0
-1,0.65,0,0,1,0.8180,1,0,0
-1,0.56,1,0,0,0.6370,0,0,1
-1,0.52,0,0,1,0.5840,0,1,0
-1,0.29,0,1,0,0.4860,1,0,0
-1,0.47,0,1,0,0.5890,0,1,0
1,0.68,1,0,0,0.7260,0,0,1
1,0.31,0,0,1,0.3600,0,1,0
1,0.61,0,1,0,0.6250,0,0,1
1,0.19,0,1,0,0.2150,0,0,1
1,0.38,0,0,1,0.4300,0,1,0
-1,0.26,1,0,0,0.4230,1,0,0
1,0.61,0,1,0,0.6740,1,0,0
1,0.40,1,0,0,0.4650,0,1,0
-1,0.49,1,0,0,0.6520,0,1,0
1,0.56,1,0,0,0.6750,1,0,0
-1,0.48,0,1,0,0.6600,0,1,0
1,0.52,1,0,0,0.5630,0,0,1
-1,0.18,1,0,0,0.2980,1,0,0
-1,0.56,0,0,1,0.5930,0,0,1
-1,0.52,0,1,0,0.6440,0,1,0
-1,0.18,0,1,0,0.2860,0,1,0
-1,0.58,1,0,0,0.6620,0,0,1
-1,0.39,0,1,0,0.5510,0,1,0
-1,0.46,1,0,0,0.6290,0,1,0
-1,0.40,0,1,0,0.4620,0,1,0
-1,0.60,1,0,0,0.7270,0,0,1
1,0.36,0,1,0,0.4070,0,0,1
1,0.44,1,0,0,0.5230,0,1,0
1,0.28,1,0,0,0.3130,0,0,1
1,0.54,0,0,1,0.6260,1,0,0

Test data.

# people_test.txt
#
-1,0.51,1,0,0,0.6120,0,1,0
-1,0.32,0,1,0,0.4610,0,1,0
1,0.55,1,0,0,0.6270,1,0,0
1,0.25,0,0,1,0.2620,0,0,1
1,0.33,0,0,1,0.3730,0,0,1
-1,0.29,0,1,0,0.4620,1,0,0
1,0.65,1,0,0,0.7270,1,0,0
-1,0.43,0,1,0,0.5140,0,1,0
-1,0.54,0,1,0,0.6480,0,0,1
1,0.61,0,1,0,0.7270,1,0,0
1,0.52,0,1,0,0.6360,1,0,0
1,0.3,0,1,0,0.3350,0,0,1
1,0.29,1,0,0,0.3140,0,0,1
-1,0.47,0,0,1,0.5940,0,1,0
1,0.39,0,1,0,0.4780,0,1,0
1,0.47,0,0,1,0.5200,0,1,0
-1,0.49,1,0,0,0.5860,0,1,0
-1,0.63,0,0,1,0.6740,0,0,1
-1,0.3,1,0,0,0.3920,1,0,0
-1,0.61,0,0,1,0.6960,0,0,1
-1,0.47,0,0,1,0.5870,0,1,0
1,0.3,0,0,1,0.3450,0,0,1
-1,0.51,0,0,1,0.5800,0,1,0
-1,0.24,1,0,0,0.3880,0,1,0
-1,0.49,1,0,0,0.6450,0,1,0
1,0.66,0,0,1,0.7450,1,0,0
-1,0.65,1,0,0,0.7690,1,0,0
-1,0.46,0,1,0,0.5800,1,0,0
-1,0.45,0,0,1,0.5180,0,1,0
-1,0.47,1,0,0,0.6360,1,0,0
-1,0.29,1,0,0,0.4480,1,0,0
-1,0.57,0,0,1,0.6930,0,0,1
-1,0.2,1,0,0,0.2870,0,0,1
-1,0.35,1,0,0,0.4340,0,1,0
-1,0.61,0,0,1,0.6700,0,0,1
-1,0.31,0,0,1,0.3730,0,1,0
1,0.18,1,0,0,0.2080,0,0,1
1,0.26,0,0,1,0.2920,0,0,1
-1,0.28,1,0,0,0.3640,0,0,1
-1,0.59,0,0,1,0.6940,0,0,1
This entry was posted in Scikit. Bookmark the permalink.

1 Response to Regression (People Income) Using a scikit MLPRegressor Neural Network

  1. Pingback: Regression Using a scikit MLPRegressor Neural Network -- Visual Studio Magazine

Leave a comment